150

Applications in Computer Vision

ric space distances to learn local features with increasing contextual scales, with novel set

learning layers to adaptively combine features from multi-scale based on uniform densities.

PointCNN [134] is introduced to learn an X transformation from input points to simulta-

neously weigh the input features associated with the points and then permute them into

latent potentially canonical order. Grid-GCN [256] takes advantage of the Coverage-Aware

Grid Query (CAGQ) strategy for point-cloud processing, which leverages the efficiency of

grid space. In this way, Grid-GCN improves spatial coverage while reducing theoretical time

complexity.

6.1.3

Object Detection

Deep Learning based object detection can generally be classified into two categories:

two-stage and single-stage object detection. Two-stage detectors, for example, Faster R-

CNN [201], FPN [143], and Cascade R-CNN [30], generate region proposals in the first

stage and refine them in the second. In localization, R-CNN [73] utilizes the L2 norm

between predicted and target offsets as the object function, which can cause gradient ex-

plosions when errors are significant. Fast R-CNN [72] and Faster R-CNN [201] proposed a

smooth loss of L1 that keeps the gradient of large prediction errors consistent. One-stage

detectors, e.g., RetinaNet [144] and YOLO [200], classify and regress objects concurrently,

which are highly efficient but suffer from lower accuracy. Recent methods [276, 202] have

been used to improve localization accuracy using IoU (Insertion over Union)-related values

as regression targets. IoU Loss [276] utilized the negative log of IoU as object functions

directly, which incorporates the dependency between box coordinates and adapts to multi-

scale training. GIoU [202] extends the IoU loss to non-overlapping cases by considering the

shape properties of the compared objects. CIoU Loss [293] incorporates more geometric

measurements, that is, overlap area, central point distance, and aspect ratio, and achieves

better convergence.

6.1.4

Speech Recognition

Speech recognition is an automatic technology that converts human voice content into the

corresponding text by computers. Because of its widespread prospects, speech recognition

has become one of the most popular topics in academic research and industrial applica-

tions. In recent years, speech recognition has improved rapidly with th‘e development of

deep convolutional neural networks (DCNNs). WaveNet [183] is one of the most advanced

frameworks for speech recognition. When assigned languages and audio spectrograms are

given, they can be recognized vividly and converse text to speech in high quality. The

data-driven vocoders avoid the error of the process of estimating the speech spectrum and

phase information, then combine them to return the speech waveform. The data-driven

vocoders is the key to which WaveNets naturally produce voice. The key to naturally pro-

duce voice about WaveNets is that new data-driven vocoders [178] avoid the error problem

of the process when estimating the speech spectrum and phase information separately and

then combine them to return the speech waveform. Instead of traditional speech recognition

applications on remote servers, speech recognition is gradually becoming popular on mo-

bile devices. However, the requirements of abundant memory and computational resources

restrict full precision neural networks. Before solving the hardware deployment problem

on mobile devices, we were unable to run or store these DCNNs with huge amounts of

parameters.